= "abc abcd abcde" text
Introduction to Regular Expressions
This post will explain the basics of regular expression. Regular expression (regexp in short) is a powerful technique of text searching and text manipulation. If you are planning to make your career in the field of natural language processing then regexp is a must have skill in your skill set.
Example: Every time you search in your word document using Ctrl+f
, regular expression works in the background.
Let’s take an example, given below, to understand it further.
We want to see if there is a string abc
present in our text data. To do that we will use the most basic form of regular expressions which specifies the search query using the actual word or text. For example, we want to search for abc
in our text data, thus, we simply specify the term abc
as our regular expression.
The following code illustrates how to search for abc
in our text data. We will use re
package.
import re
# searching for pattern 'abc' in text. use r in the start of the pattern
= re.findall(r'abc',text)
result
print(result)
['abc', 'abc', 'abc']
We can see in the result above that there were three occurrences of pattern `abc``. We have now seen the most basic form of using regular expression (i.e., simply using the text).
's change our text data the following
Now, we will move towards a more advanced form of regular expressions. Let
text = "abc abcd abcde bcd apple ddeffe eef ggh"
Let’s now search all the words of three characters [no numbers]. We can only use our first approach of specifying the word itself for searching if we know all the three characters long words in the text data. However, we may not necessarily know about all of them.
How to search for three-characters-long words in the text?
To answer this we will use a special purpose character .
which matches a single occurence of any word character. For our searching, we can simply specify .
three times to search every three characters word.
Let’s apply it first to see it’s working.
import re
= "abc abcd abcde bcd apple ddeffe eef ggh"
text
= re.findall(r'...',text)
result
print(result)
['abc', ' ab', 'cd ', 'abc', 'de ', 'bcd', ' ap', 'ple', ' dd', 'eff', 'e e', 'ef ', 'ggh']
What went wrong? The results are not what we expected (i.e., all three-characters-long word). The reason is that the text is treated as sequence of characters and here characters are not limited to alphabets and numbers. Blanks are also considered as characters. So when we specify .
it matches any single character including blank space.
To correct this we will use another special character \b
which matches word boundry (in our case it is blank spaces between words).
Now, we will change our regular expression to \b...\b
which will match three-characters-long words which exists independently. Let’s see now the results.
import re
= "abc abcd abcde bcd apple ddeffe eef ggh"
text
= re.findall(r'\b...\b',text)
result
print(result)
['abc', 'bcd', 'eef', 'ggh']
It worked now as expected.
Similar to the special characters we have seen so far, there are other characters as well with special meaning. These characters makes it easier to create patterns for searching. The list of most commonly used special characters is given below.
Pattern | Description |
---|---|
[abc] | Matches a single character among a,b,c |
[^abc] | Matches a single character except a or b or c |
[a-z] | Matches a single character in a-z |
[^a-z] | Matches a single character except a-z |
. | Matches any single character |
\d |
Matches any single digit |
\w |
Matches any single word character (i.e., a character between a-z or A-Z or _ or 0-9) |
\s |
Matches any white space character (i.e., space, tab, new line) |
\b |
Matches word boundary |
^ | Matches start of string |
$ | Matches end of string |
a? | Matches zero or one occurrence of character a |
a+ | Matches one or more occurrences of character a |
a* | Matches zero or more occurrences of character a |
a{4} | Matches four occurrences of character a |
a{2,4} | Matches occurrences of a between 2 to 4 |
a{2,} | Matches either two or more occurrences of a |
Examples
Let’s now see some examples of using regular expressions.
Example-1: Search all numbers present in the below text.
import re
= 'Apple Bus 123 Air 34 Data 33 45Egg'
text
# \d matches any digit while + matches one or more occurrence of preceeding pattern (i.e., digit)
= re.findall(r'\d+',text)
results
# print results
print(results)
['123', '34', '33', '45']
Remember to use \b
in the expression if you want to extract the standalone numbers not which are parts of a string (e.g., 45 in 45Egg).
import re
= 'Apple Bus 123 Air 34 Data 33 45Egg'
text
# \d matches any digit while + matches one or more occurrence of preceeding pattern (i.e., digit)
= re.findall(r'\b\d+\b',text)
results
# print results
print(results)
['123', '34', '33']
References 1. https://regex101.com/ 2. https://www.regexone.com/ 3. https://docs.python.org/3/library/re.html